Generated: /root/verbalize-sampling
Experiments: 4
Metrics: diversity, ngram, joke_quality
| Name | Task | Method | Model | Responses | Temperature |
|---|---|---|---|---|---|
| direct (samples=1) | joke | direct | anthropic/claude-3.7-sonnet | 30 | 0.7 |
| sequence [strict] (samples=5) | joke | sequence | anthropic/claude-3.7-sonnet | 30 | 0.7 |
| multi_turn [strict] (samples=5) | joke | multi_turn | anthropic/claude-3.7-sonnet | 30 | 0.7 |
| structure_with_prob [strict] (samples=5) | joke | structure_with_prob | anthropic/claude-3.7-sonnet | 30 | 0.7 |
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Did you hear about the mathematician who's afraid of negative numbers? He'll stop at nothing to avoid them!
Tell me a joke.
What do you call a fake noodle? An impasta!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
| Experiment | Average Diversity | Min Diversity | Max Diversity | Std Diversity | Average Response Length | Average Unique Words | Average Vocabulary Richness | Total Cost |
|---|---|---|---|---|---|---|---|---|
| direct (samples=1) | 0.1077 | 0.0000 | 0.4850 | 0.1519 | 12.4533 | 12.2267 | 0.9907 | 0.0000 |
| sequence [strict] (samples=5) | 0.3170 | 0.0000 | 0.4849 | 0.1340 | 13.9800 | 13.4533 | 0.9723 | 0.0001 |
| multi_turn [strict] (samples=5) | 0.2853 | 0.0000 | 0.4922 | 0.1226 | 13.9000 | 13.3933 | 0.9726 | 0.0001 |
| structure_with_prob [strict] (samples=5) | 0.3231 | 0.0000 | 0.4955 | 0.1334 | 14.0867 | 13.6000 | 0.9747 | 0.0001 |
| Experiment | Average Rouge L | Min Rouge L | Max Rouge L | Std Rouge L | Average Response Length |
|---|---|---|---|---|---|
| direct (samples=1) | 0.7010 | 0.0000 | 1.0000 | 0.3855 | 12.4533 |
| sequence [strict] (samples=5) | 0.1720 | 0.0000 | 1.0000 | 0.2982 | 13.9800 |
| multi_turn [strict] (samples=5) | 0.2178 | 0.0000 | 1.0000 | 0.2771 | 13.9000 |
| structure_with_prob [strict] (samples=5) | 0.1695 | 0.0000 | 1.0000 | 0.2909 | 14.0867 |
| Experiment | Funniness | Cleverness | Originality | Structure | Overall | Normalized Overall |
|---|---|---|---|---|---|---|
| direct (samples=1) | 3.5733 | 3.7200 | 2.2267 | 4.2733 | 3.4457 | 0.6891 |
| sequence [strict] (samples=5) | 3.4133 | 3.7467 | 2.5600 | 4.2267 | 3.4480 | 0.6896 |
| multi_turn [strict] (samples=5) | 3.3600 | 3.7400 | 2.4067 | 4.1467 | 3.3823 | 0.6765 |
| structure_with_prob [strict] (samples=5) | 3.4067 | 3.7600 | 2.4267 | 4.1600 | 3.4120 | 0.6824 |